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Abstract 

We consider character sequences evolving on a phylogenetic tree under the 
TKF91 model. We show that as the sequence lengths tend to infinity the the 
topology of the phylogenetic tree and the edge lengths are determined by any 
one of (a) the alignment of sequences (b) the collection of sequence lengths. 
We also show that the probability of any homology structure on a collection 
of sequences related by a TKF91 process on a tree is independent of the root 
location. 

Keywords: phylogenetics, DNA sequence evolution models, identifiability, align- 
ment 



1 Introduction 

One of the important mathematical problems in phylogenetics is formulated as follows. 
Suppose a collection of sequences (of nucleotides or amino acids) has evolved from a 
common ancestral sequence under a stochastic process of evolution on a phylogenetic 
tree. The process of evolution allows substitution of characters, insertion or deletions 
of segments of the sequence, and possibly other types of mutations. A natural ques- 
tion is: given current sequences (that is, sequences at the leaves of the tree), can we 
reconstruct the evolutionary relationship between them unambiguously? Here evolu- 
tionary relationship means the topology of the underlying phylogenetic tree and edge 
lengths, as well as other parameters that define the stochastic process, and possibly 
more information such as the homology relationships between individual characters. 
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One hopes that given long enough sequences, the evolutionary relationship between 
sequences is obtained accurately. 

There are several reasons for studying such models. Such models are a useful 
compromise between biological reality and computational tractability. They allow 
the use of maximum likelihood methods for estimating evolutionary relationships, 
and maximum likelihood methods can be proved to be consistent for many such 
models. 

Several earlier studies focused on pure substitution models only, that is, mod- 
els that didn't consider insertion-deletion events. In one of the earlier papers by 
Felsenstein pQ , an algorithm was proposed to compute the likelihood of a tree (given 
sequences) for a class of reversible substitution models. Many invertibility (unique 
identifiability of the underlying tree) results have already been known for classes of 
Markov substitution models. For example, in |2] and [5j it was shown that the tree 
topology and edge lengths are identifiable from the joint distribution of character 
states at pairs of terminal nodes under some mild conditions on the Markov transi- 
tion matrices on the edges of the tree. This result was proved using the following 
result due to Hakimi and Patrinos j3] about additive functions on trees. 

Lemma 1. Let T be a tree on the vertex set V . Let f be a non-zero real valued 
function defined on the set of subsets of V of cardinality 2, satisfying the additivity 
condition 

r-l 

f(i x >y}) = ^2f({^,x i+1 }) 

i=0 

where Xo,£i, . . . ,x r is the unique path in T connecting X — Xq and y = x r . Then 
the value of f on all pairs of leaf nodes of T determines uniquely the tree T and the 
function f . 

A restricted version of the above result for positive weights was proved by Zaretskii 
jH] and Buneman j^j. 

General identifiability results in case of four-state characters, which are of par- 
ticular interest for the phylogenetic analysis of DNA sequences, were proved using a 
mathematical technique known as Hadamard Conjugation, for example, see [3J |HJ E|. 
However, these results are based on certain symmetry assumptions on transition ma- 
trices. Strong results for general 12 parameter 4x4 Markov transition matrices have 
been proved by Chang ^Uj- He has proved that full reconstruction (under some mild 
restrictions on the Markov models) is possible by looking at character distributions 
at triples of leaf nodes. 

The above mentioned results do not consider insertion-deletion events. In 1991 
and 1992, Thorne, Kishino and Felsenstein fTJ ^] introduced models that allowed 
insertion-deletion events in addition to character substitution events. We refer to the 
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models by TKF91 and TKF92, respectively. The models consider sequence evolution 
as a birth-death process. In the TKF91 model, a birth event inserts a single character 
in the sequence, and a death event deletes a single character from the sequence. In 
the TKF92 model, a birth event inserts a segment of characters in the sequence, and 
a death event deletes a segment of characters from the sequence. They proposed 
dynamic programming alignment algorithms for a pair of sequences. Recently the 
models have been extensively studied, and many intricate algorithms for sequence 
alignment and phylogenetic tree reconstruction have been proposed. Although the 
algorithms based on the TKF framework are difficult to describe and to implement, 
they have many advantages, see for example, fSJEJEi!- It is therefore important 
to analyse the TKF models mathematically in the same way the symmetric and 
asymmetric substitution models were analysed by Hendy, Penny, Steel and Chang. In 
particular, it is worthwhile to know if these models are invertible and if the algorithms 
based on them are consistent. 

In this paper, we prove two invertibility results for the TKF91 model. We prove 
that an alignment of sufficiently long multiple sequences, (to be precise a homology 
structure on them,) contains enough statistical information to estimate the parame- 
ters of the birth-death process as well as the tree topology and edge lengths. We also 
show how to estimate the topology and model parameters from only the sequence 
lengths provided the sequences are sufficiently long. In the appendix, we prove that 
the probability of a general homology structure on a collection of sequences does not 
depend on the root location. This result is of theoretical as well as practical impor- 
tance. It has been used (without proof) to test implementations of algorithms in |14j . 
Thus this paper gives a rigorous mathematical analysis of the TKF91 model rather 
than proposing new algorithms. 

The paper is structured as follows. In Section |21 we give an overview of the 
TKF91 model, and demonstrate the main idea of this paper for a pair of sequences. 
In Section El we show the invertibility based on the homology structure, and in 
Section 0] we prove the invertibility based on sequence lengths. In the last section 
we show the independence of the root location and the likelihood of the homology 
structure, and also discuss future directions. 

2 The TKF91 model 

Let a DNA or an amino acid sequence be represented by an alternating string of links 
and characters. The characters are denoted by # throughout, since the actual base 
or an amino acid is not important in any discussion in this paper. At the left end of 
the string is an immortal link. All links undergo a birth process at a rate A. When a 
birth event occurs, a new character is placed to the right of the link, and a new link is 
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placed to the right of the new character. All characters undergo a death process at a 
rate fi. When a character dies, the character and the link to its right are deleted from 
the string. The new character born in a birth event is chosen from a distribution. 
While a character is alive, it undergoes a substitution process at a rate s, and the 
substituted character is chosen from a distribution. The substitution model and the 
model for the choice of the new character in a birth event are not relevant here, since 
the homology structure (information as to which characters are homologous to which 
others) is what interests us in this paper. Suppose that a sequence A evolves to a 
sequence B in time t. Since all characters at all times are assumed to be evolving 
independently of all other characters, we look at the evolution of a single character in 
A. We describe the model by three coupled differential equations and their solutions. 
We refer the reader to ^T] for details. 

Let Pn(t) be the probability that a character survives for time t, and at time t, it 
has n descendents (counting the survived character as one of the descendents). Here 
superscript H stands for "homologous". By definition, p^(t) = for n < 0. The 
differential equation for p^if) is 



dt 

with the initial conditions 



X(n - l)^ + ^np" - (A + fi)np^ (1) 



pf(* = 0) = 1 

p H(t = o) = for n > 1 (2) 

Let p 1 ^ {t) be the probability that the character in A dies before time t, but leaves n 
descendents in B. Here superscript N stands for "non- homologous" . By convention, 
Pn {t) = for < 0. The differential equation for p 1 ^ (t) is 



-f - = A(n - 1)^! + Kn + + - (A + p)np N n (3) 



with the initial condition 

p%(t = 0) = for all n (4) 

Let Pn{t) be the probability that the immortal link has n descendents at time t. 
Here superscript / stands for "immortal". By convention, p^t) = for n < 0. The 
differential equation for p^it) is 

= Anp^_ x + ii{n + l)p n+1 - A(n + l)p£ - /inp n for n > (5) 
with the initial condition 

plit = 0) = for all n > (6) 
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Lemma 2. Solutions to the above differential equations are given by 



P 



where 



*(t) = e-^(l-A/3(t))(A/3(t))"" 1 forn>0 

Pn = vP(t)forn = 

= (1 - e - "* - np(t))(l - \(3(t))(\(3(t)) n - 1 forn>0 

p J n = (l-A/3(t))(A/3(t))" forn>0 (7) 



1 _ P (A-|t)* 

0(f) 



// - Ae( A -^>* 

The main idea of this paper is the following observation about the solutions. 

Corollary 3. Let A and B be two sufficiently long sequences related by the TKF91 
model, A being the ancestor of B. Then model parameters (scaled time and the ratio 
of birth and death rates) are uniquely determined by the true alignment of A and B. 

Proof. Observe that e~ M * is simply the probability that a character in A survives in B, 
e~ M 'A/3(t) is ^2 n>2 Pn-> an d 1 — e ~ M * — ///3(t) is J2 n >iPn ■ This allows us to accurately 
estimate fit, X/3(t) and fi/3(t), and therefore X/fi, from an alignment of sufficiently 
long sequences. □ 

In the next section, this idea, together with Lemma ^ is applied to an alignment 
of several sequences. One has to show that probabilities of certain patterns in the 
alignment do not depend on the position of the root on the tree. Since not all patterns 
are required for estimating the pairwise distances between leaves, the independence 
of the probability of a pattern and the root position for general patterns is presented 
in the Appendix. 



3 Multiple sequences 

The main result of this section is the following. 

Theorem 4. A multiple sequence alignment of sufficiently long sequences that have 
evolved under a TKF91 process on a phylogenetic tree uniquely determines the phy- 
logenetic tree and the (scaled) edge lengths. 

Before presenting the full proof, it will be useful to look at the main ideas behind 
the proof. 

1. To be able to apply Lemma [I] to construct the tree, we would like to compute 
the distance between each pair of vertices. Here the distance between a pair of 
vertices is the (scaled) divergence time between the corresponding taxa. 
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2. The distance between a pair of vertices is estimated by the method of Corol- 
lary El This is done by restricting the multiple alignment to the two sequences 
under consideration and then equating the observed pattern frequencies with 
those expected from the solutions of the TKF model. 

3. Before we can equate the observed frequencies of patterns and the expected 
frequencies, we have to solve one important problem: in the discussion of the 
TKF91 model, we assumed that the sequence A is the ancestor of the sequence 
B. But if the sequences A and B evolve from a common ancestor C that is not 
A or B, do we expect the same pattern frequencies? The bulk of the proof is 
devoted to answering this question by showing that the pattern probabilities do 
not depend on the location of C. 

Let Si] 1 < i < k be the sequences at the leaves of a phylogenetic tree T with 
root vertex v. The sequences are related by a TKF91 process on the tree. The 
vertices of the tree are labelled Vi such that the first k of them are leaf vertices 
respectively associated with the sequences S^. Each vertex Vi has associated with it 
a time parameter ti, which is the duration for which the TKF91 process operates on 
the edge leading to i>, from its immediate ancestor a(vi). 

Consider any two sequences, say Si and 5*2, and let Vi be their most recent common 
ancestor. That is, the ancestral sequence S at the root v evolves into a sequence 
Si along the path from v to Vi, and then two copies of Sj evolve independently to 
sequences Si and S2, respectively, along paths from Vi to V\ and Vi to v%. It is 
assumed that the unknown ancestral sequence at the root has already evolved for an 
infinite duration. It implies that the sequence Si has evolved for an infinite duration. 
Therefore, the homology relationship between Si and S2 does not depend on the 
location of v, but perhaps depends on the location of Vi on the path between V\ and 
v 2 . Therefore, the analysis of the homology structure of Si and S2 may be performed 
by assuming that Vi is the same as the root v. In other words, we will first analyse only 
two sequences evolving from their most recent common ancestor, which is assumed 
to have evolved for infinite time. 

The restricted alignment and its blocks 

Consider the alignment of Si and S2 that is obtained by discarding all other rows 
in the multiple alignment (sequences S 3 to S^), and then discarding the columns 
containing only gaps in the first two rows (rows for Si and S2). The most recent 
common ancestor of Vi and v% is v. Since we have discarded other sequences, and 
pruned the tree, we can assume that v and v\ are separated by time ti, and v and v-i 
are separated by time ti- We are interested in showing that the probabilities of the 
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patterns of interest are functions of t\ + t 2 , that is, they do not depend individually 
on ti or t 2 . 

An alignment of two sequences can be thought of as a sequence of blocks where a 
block consists of a pair of homologous characters followed by non-homologous char- 
acters, and terminated by a pair of homologous characters of the next block. For 
example, a possible alignment of Si and S 2 , ignoring the beginning and end is 

(Si4 ####-#-# #\ 

\s 2 # #-#-###- 4) 

The alignment shows four types of blocks. 

A= v# #){#-#){#-# *) v# # - #J 

Homologous characters in a column imply that there was a character in ancestral 
sequence S that survived in both Si and S2. We are interested in computing the 
probabilities p(A), p(B), p{C) and p(D) of observing, respectively, blocks of types A, 
B, C and D. 

The classes of events that result in the blocks of types A, B, 
C and D. 

Computing the probabilities involves summing over infinitely many possible events in 
the history that result in a block of a certain type. For example, a block of type A is 
a result of historical events of the type 

51 # - #\ 

S # # j # ;*>0 

5 2 # - #/ 

Here we have % characters in the ancestral sequence S that died along the edges v — v\ 
and v — v 2 . Of course, we do not know S. So looking at the alignment of Si and 
S 2 , we do not know how many characters died resulting in a block of type A. So the 
probability of observing A would be the sum of probabilities over all non-negative 
values of i. 

Similarly, there are two types of events that result in a block of type B. They are 

(Si # - # - #\ (Si # - # - #\ 

E l B = \S # # J # #* # and ^ = [S # # l - P # 
\5 2 # - - - #/ V s2 # - " " #/ 
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where % > 0,j > 0. Therefore, we write p(B) = p{E B ) + p(E B ). Here E B is the 
possibility that the middle character in S\ was homologous to a character in S, but 
along the branch from v to v 2 , the character in S died. It is also possible that the 
middle character in Si in block B was created in a birth event during the evolution 
from S to St. This is the possibility E\. 

The following four types of events result in a block of type C . 

(St # - # - - - #\ 
E},= ( 5 # #< # #>' # # fc #) ;i>0,j>0,fc>0 



= ( 5 # #* # - # fc # 1 ;i > 0,j > 0,fe > 



Si 


# 




# 








s 


# 




# 




# 




S 2 


# 








# 




Si 


# 




# 








s 


# 


#* 


# 


# j 




# fc 


s 2 


# 








# 




Si 


# 




# 








S 


# 


#* 




# j 


# 


# fc 


s 2 


# 








# 




Si 


# 




# 








S 


# 












s 2 


# 








# 





5 # # l - #^ # # fc # ;i>0,j>0,fc>0 



^ = | 5 # # i _ # i _ # fe # | ;j>0,j>0,fc>0 

Observe that they are disjoint classes of events, so we write 

p{C) = p{E l c ) + p{El) + p{El) + p{E" c ) 

Classes of events resulting in blocks of types D, are similar to those that produce 
the pattern C. But there is no distinction between types C and D when j = in Eq. 
In the language of |14j . C and -D have the same homology structure. Therefore, we 
compute 

p(C V D) = p{C) + p{D) - p(E*; j = 0) 

The probability computations involve summations over i, j, k varying over all non- 
negative integers, but the summations can be easily computed since they happen to 
be geometric series. 



The Markov structure along the sequences 

How do we compute the probability of a class of events contributing to a block? The 
probability of an alignment of sequences on a tree is the product of the probabilities of 
the pairwise alignments on the edges of the tree. Also, an alignment of two sequences 
has a Markov structure along the sequences provided one of the sequences is the 
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ancestor of the other, see (THj for details. We use this idea to compute probabilities 
of classes of events using the Markov structure on v — V\ and v — v 2 . 

The transition matrix for the Markov chain for a two sequence alignment, assum- 
ing one of the sequences to be the ancestor, is given by 







H 


D 




end 






H 


/(l- 


A/3)(A) e -^ 


(l-A/3)(A)(l- e -^) 


Xf3 


(1-A/3)(1 


An 


\ 


D 


(1- 




(1 -«)(£)(! - e -*) 


K 


(l-/c)(l- 


- -) 




I 




A/3)(J)e-" 


(l-A/3)(A)(l_ e -^) 


X(3 


(1 - A/3)(l 


An 





(9) 



where H (homology), D (deletion) and / (insertion) denote the three states 
# 



and | ), respectively, and 



n(t) = k = 1 



1 - e - ^ 



# 
# 



Now on we denote /3(tj) by e Mti by Xj, by for i G {1, 2}, and f3(t) 
P(ti + 1 2 ) by p, e-^ by X = X X X 2 , and n(t) = Kfa + 1 2 ) by «. 



Computing 

Using the transition matrix in Q, we can write 



l-A/3 x )(l-A/3 2 ) ( 
+ (l-AA)(l-A/3 2 ) ( ^ ) (1-Xi)(l-X 2 ) 



x 



i-1 



A 



x (l- Kl )(l-« 2 ) (jJX^ 

(l-AA)(l-A/j 2 )(^)e-^+ t2 ) 
l-(A )( i_ Kl ) ( i_ K2 ) ( i_ Xl )(l-X 2 ) 



(10) 
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where the first term corresponds to i — 0. 
One can verify that 



(1-AAX1-A&) (l-A/3i)(l-A^) 



1 - (A)(l - - K2 )(l - X0(1 - X 2 ) 1 - A/X/3^2 

= l-A/3 (11) 

Therefore, 

p(^) = (1 - A/3) e-" (12) 
Thus, as required, p(A) depends only on t — ti+t 2 but not on t\ and t% individually. 

Computing p(B) 

We compute p(E^) as a product two factors Fi and F 2 which, respectively, correspond 
to the transitions shown below. 

v # # l # ;*>0and L # # ; j > 

2 # - -/ \2 - - #/ 



Therefore, 



F = (1-A&X1-A&) ( ^ )X!(1-X 2 ) 

+ (l-A/3 1 )(l-A/3 2 )(-)(l-X 1 )(l-X 2 ) 



-l 



x ^f(l-/ tl )(l-/ t2 )(-)(l-X 1 )(l-X 2 )) 
x (l-zcxXl-zca) -X 2 ) 



and, 



= (l-A^Q^X^l-X,) (13) 



(l-AAXl-KaXjWk 

F2 = t^a^ (14) 
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Therefore, 



(1 - \f>){\ - Aft)(l - K2 )(i) 2 A'f.Y 2 (l - X 2 ) 

v(Eb) = r^vftft (15) 



Next, the contribution from the class of events E B is computed. 
P(^) = (Wl-A/3 2 ) + 

(l-Aft)g)IA 

l-A/i/3i/3 2 v ; 

Adding p(E B ) and p(E B ), and after simplification, we have 

p(S) = (1 - \(3)\(3 e"* (17) 

Again, p(B) depends only on t = t± + t 2 . 

Computing p(C V D) 

This computation is performed on similar lines as the computation for p(A) and p{B) 
except that the events in Eq when j = also contribute to p(Ef)). Therefore, 

p(CVD) = p(£*) + p{E l D ) + p{El) + p(El) + p(E 3 c ) + p(E 3 D ) 

+ p(E' c ) + p(E' D ) - p(E 4 c ; j = 0) (18) 

Skipping the details, we state below different terms that contribute to p(C V D). 

p{Eh)+p{Ei) = (19) 



■2^ ,,, ,-2 , _ v / ^ , * [X 1 {l-X 2 )K 2 +X 2 (l-X 1 )K l 



p(E' c )+p(Ej ) ) = (l-\py - e 



/V V/V V 1 _ A///3i/3 2 



(20) 



p(^)+p(^) 



1 „— /rf / W/1 V \ V ... i/1 v N v 



= (1 - A/3) - e""* - ((1 - Xi)X 2 «i + (1 - X 2 )X lK2 ) 



+ (i a/3) f A l c-^ f X ) M^il ~ W + V^i(l - A/3i) 



ii 



= (1 - A/5) 2 e""* (1 - X 1 )(l - X 2 )k,k 2 



^ f±\ n _ y.Vi _ v f 



+ 2(1 - A/3) ( - j *)(! - X 2 ) x ^ ^ _ ^ 

+ (l-AJ)(^e-' rt ^)(l-A- J )(l-A- 2 ) 



A 4 / V/ 1 

A/?i(l - A/3 2 )(l - Ki )k 2 + A/3 2 (l - A/?x)(l - 
X 1 - A/zAft (22) 

Adding Equations (USD, ((201), (ED), and ((22), and simplifying, we get 
p(CVD) = (l-A^Q)(l- e ^-^)Q)e^ 



(1 - e - "* 


- / 






x 


)' 






- -e-^ - 


A/3 



-/it 



—/it 

(23) 



Identifying T, A^ and //tj 



The pattern probabilities obtained above are sufficient for the estimation of model 
parameters and the scaled edge lengths. From Equations (|12|) and ([17]) we can write: 

A\ p(A) p{Af 



-\e "* 



fij l-A/3 p(A)-p(S) 

12 



Substituting Xf3 and e M * in Equation (J23j) we can express and e M * (and 

hence /it and At) in terms of p(A), p{B) and p(C V D). 

Repeating this exercise for all pairs of leaves and then applying Lemma ^ we see 
that the tree topology and the scaled time parameters on all edg function of 

the probabilities of blocks of type A, B and C V D in all pairwise alignments. This 
completes the proof of Theorem |3J 

4 Invertibility based on sequence lengths 

The main result of this section is 

Theorem 5. Given a collection of sequences that have evolved under a TKF91 process 
with parameters X and fj, on a phylogenetic tree, the phylogenetic tree and the scaled 
edge lengths are uniquely determined by the collection of sequence lengths provided the 
sequences are sufficiently long. 

The idea of the proof is that if a sequence of length Xq evolves under a TKF91 
model into a sequence of length X in time t, then assuming X to be the expected 
length of the sequence at time t allows us to estimate fit. To justify this, we demon- 
strate that the relative standard deviation of the length distribution goes to zero as 
the initial length tends to infinity. 

Let P(X,t) be the probability that a sequence evolving under the TKF91 model 
has length X at time t. The differential equation for P(X,t) is 



with the initial conditions P(X = X ,0) = 1 and P(X ^ -^o,0) = 0. Observe that 
this is slightly different from the standard equation for the simple birth-death process 
because of the immortal link, which has death-rate. Let Mi(t) and M2(t) denote 
the first and the second moments of P(X,t). They are calculated below. 

Multiplying both sides of Equation (J2U) by X, and summing over X, we get 



dP(X, t) 



\XP(X-l,t)+n(X + l)P(X + l,t)-\(X + l)P(X, t) 



(iXP(X,t) (24) 



dt 



dM x 



(A 



fi)M 1 + X 



(25) 



dt 



with the initial condition 



o 



(26) 



This has the solution 



M l {t) 



((X- fx)X + X)e^-^ - X 
X — n 




(27) 
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Multiplying both sides of Equation (J24j) by X 2 and summing over X, we get the 
differential equation for M 2 (t). 

^ = 2(A - /i)M 2 + (3A + f i)M 1 + A (28) 
at 

with the initial condition 

M 2 (0) = X 2 (29) 

This has the solution 

M (A (A + /i)A _ ((A-^)Xo + A) (3A + /X) (A _ M)t 

/^ 2 , ((A-/i)Xo + A) (3A-Km) (A + /i)A ^ 2(A _ M)t 

V U^W V^WJ ( } 

Therefore, the variance is given by 

a 2 = M 2 -M 2 = (A + ^)(A-^)X + A 2 p2(A _^ 



(A-/i) 



2 



(A + /i)((A-/i)X + A) (a _ m) * V 

(A-/,) 2 + (A-/,) 2 (31) 

The expected length L at equilibrium is given by y^a?]? ^° we calculate a 2 /L 2 as 
follows. 

a 2 = (A + /i)(A - //)X ^(x-rit _ e (A- M )t) + ^( e 2(A- M )t _ A + / i c (A- At )t + ^ 

(32) 

This tends to as X/fi tends to 1. Therefore, for sufficiently long sequences the tree 
and scaled time parameters can be estimated as follows. 
First we estimate -. 

- = — (33) 
\i L+l V ; 

Now for any two sequences Si and S 2 , with lengths Xi and X 2 , respectively, 
fit = fj,(ti + t 2 ) (which is the scaled time separating Si and S 2 ) is estimated from 
Equation (|2T|) by substituting X = Xi and Mi(t) = X 2 . Repeating this for all pairs 
of sequences, and applying Lemma [TJ the tree T and the scaled time parameters Xti 
and [iti are uniquely recovered. 

Remark. A weaker form of the results in the appendix is implicitly assumed here: 
for every pair of sequences, we treated one of them to be the ancestor of the other. 
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5 Discussion 



This paper gives a mathematical analysis of the TKF91 model on a tree. We present 
two uniqueness results for sufficiently long sequences: the first one establishes a cor- 
respondence between the distribution of certain patterns in a multiple alignment and 
the phylogenetic tree and scaled time parameters associated with edges of the tree; the 
second result gives a correspondence between sequence lengths and the phylogenetic 
tree along with other model parameters. Results presented here give us consistency 
criteria that the alignments and trees constructed under the TKF91 model from a set 
of sequences must meet. 

It is not likely that the idea of the proofs presented here will be useful as it is for 
building trees. The limiting situation of sequence lengths going to infinity arises si- 
multaneously with the ratio A//i going to one, and we do not know if the real sequences 
can be modelled with the ratio close to one. Nevertheless, the correspondence between 
the homology structures and the trees has useful algorithmic implications. Lunter et 
al. ^1] mention that a good method of sampling multiple alignments under a fixed 
tree is currently missing. The correspondence between the homology structures and 
the trees could be useful in an alignment sampling procedure. Many alignments could 
be discarded purely on the basis of pattern probabilities without actually evaluating 
the alignment likelihoods. There are two more directions in which future progress 
may be made. For substitution models, Erdos, et al. [T7j have studied the problem 
of finding a lower bound on sequence lengths that would be sufficient for the correct 
reconstruction of the tree with a high probability. Interestingly this bound is not 
large: for n taxa, sequence lengths of the order of a power of logn are sufficient. It is 
conceivable that such a result exists for TKF type processes. Another direction is to 
prove similar results for the variants of TKF model, such as the model proposed by 
Metzler et al. jTH] i n which A and \x are the same and the variants in which longer 
segments are inserted or deleted, (for example, the TKF92 model and the model of 

Appendix: General Homology Structures 

A purpose of this appendix is to demonstrate that the likelihood of a general homology 
structure on sequences related by a TKF91 process on a tree does not depend on the 
location of the root on the tree. 

Let [i,j] denote the set of integers from i to j. A rooted phylogenetic X-tree T 
is a rooted tree with leaf set X = {xf, i G [l,p]}, root vertex x of degree at least 
2, and all other vertices (denoted by Xi\% G [p + l,p + q\) of degree at least 3. Let 
V and E denote, respectively, the vertex set and the edge set of T. The edges of 
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T are directed away from the root. Associated with each vertex x± is a sequence Si 
of length Li + 1 over a finite alphabet E. The sequences are related to each other 
by a TKF91 process with parameters A, fi and time parameter t e for each edge e of 
T. The sequences Si;i G are observed sequences, and remaining sequences are 
unobserved. 

Let Sij denote the j-th character in Si. A segment of Si from s*j to is denoted by 
Si\j,k}. The characters s$j and are said to be homologous (denoted by s»j ~ sjy) 
if Xi and have a common ancestor Xh such that a character in SVi survives as 
characters and under the TKF91 process. For notational convenience we think 
of the immortal link as the zeroth character in all sequences, and write s^o ~ Sjo 
for all i and j. Thus each Si has Lj characters other than the immortal link. By 
definition, each character in each sequence is homologous to itself. Therefore, ~ is 
an equivalence relation on {s^-; i G [0,p + q]}. The collection of equivalence classes 
(or their restriction to a set of sequences {S^ iG/C [0,p + q]}) is called a homology 
structure on the sequences [HJ, and is denoted by H* (or Ti 1 ). 

For a character '-' not in £ (called the gap character), let E' = E U { — }. An 
alignment of Sf, i G / C [0,p + g] is a collection {Ri = (r^); i G /} of sequences over 
£' such that 

1. deleting gap characters from Ri gives Si 

2. all i?i have the same length 

3. for each j, there is an i such that is a non-gap character 

4. for each j, non-gap characters constitute an equivalence class of Ti 1 . 

Each alignment corresponds to a unique homology structure, so each homology struc- 
ture is represented by a representative alignment. 

For a directed edge (xi,Xj) of T let 7Y lJ denote the homology structure obtained 
by restricting H* to the sequences Si and Sj. Let P(TC*) denote the probability of 

n*. 

Proposition 6. 

P(H*)=P(L ) P ( nij \ L i) ( 34 ) 

(xi,xj)eE 

where P(L ) is the equilibrium probability of L given by 
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Proof. This follows from the fact that the sequences evolve independently along the 
edges of the tree. □ 



Let 7i denote a homology structure on the observed sequences. Its probability 
P{H ohs ) is given by the following 

Proposition 7. 



where the summation is over all homology structures 7i* on i G [0,p + q] whose 
restriction on S^ i G [l,p] is Ti obs . 

Proof. The R.H.S. is a sum of the probabilities of all possible histories that result in 



In this appendix we want to prove that P(7i obs ) does not depend on the location 
of xq. The idea is to apply Felsenstein's pulley principle [I]. But to be able to do 
that, we first have to prove the pulley principle when there are just two observed 
sequences. 

Let S^, i G [0,2] be three sequences, S being the ancestor that evolves to Sj in 
time ti, for i G [1, 2], and let t — t\ + t 2 . 

A homology structure on two sequences can be decomposed into blocks of charac- 
ters bounded on the left by homologous characters. In the case of the leftmost block, 
the immortal links constitute the left boundary. This is formally defined below. 

We introduce the term related only for the analysis of Sq, Si and S 2 : two characters 
Si a and Sjb are said to be related if they are descendents of the same character in Sq. 
By definition, each character in So is its own descendent and parent. A set of segments 
{S [a, a + u], Si[b,b + v], S 2 [c, c + w]} is called a block of TC* if the following conditions 



1. s 0a ~ sib ~ s 2c (Note: by convention, s i0 ~ Sj .) 

2. together the three segments are closed under relatedness. 

A two element subset {Sj[a, a + u], Sj[b, b + v}} of a block is called a block of Tt^ . 
The event that the set {So [a, a + u], S\[b,b + v], S 2 [c, c + w}} is a block is denoted 
by B*(a, u, b, v, c, w). The event that the set {Si[a, a + u], Sj[b, b + v]} is a block is 
denoted by B l \a,u,b,v). Probabilities of these events are computed by summing 
over the probabilities of all possible contributing events. 




(35) 



the homology structure Ji 



□ 



hold 
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Lemma 8. Fori G [1,2], 



P(B 0t (a,m,b,n)\s ib ~ s 0a , L > a + m) 
= E ( T ) (/iAr ( m - e ) (1 - /^r~ e (AA) n - m+e (l - W 



m— e+1 



(36) 



Proof. This follows from the solutions of the TKF91 model given in Lemma|21 A term 
in the above summation gives a contribution to the probability when e characters in 
So[a,a + m] have no descendents in Si. The remaining m — e + 1 characters in 
So [a, a + m] have n + 1 descendents in Si (counting Sj&, which is a descendent of s 0a ). 

Therefore, there are m — e + 1 possible groups of siblings in Si, and ( U j ways 

of making groups. Each group contributes a factor (1 — A/3j). The probability that 
a character in So has at least one descendent in Si is 1 — /i/3j. Therefore, all groups, 
except the first group that corresponds to the descendents of So a , contribute a factor 
(1 — n(3i). A group of k > siblings contributes a factor (A/3j) fc_1 . Thus all groups 
of siblings together contribute the factor (\j3i) n ~ m+e . Note that there is no change in 
the formula even when a = b = 0. □ 

The above probability does not depend on a and b. It is simply the probability 
that ri + 1 characters in the ancestral sequence have m + 1 descendents after time ti, 
with the first character in the ancestor surviving after time ti. Let this probability 
be denoted by P(n — » m, ti). 

Corollary 9. Fori G {1,2}, 

(X//j,) n P(n -> m, ^) = (A//i) m P(m -> n, ^) (37) 

Proof. When the probabilities in the above equation are written using Lemma |Hl the 
coefficients of (1 — fi/3i) k (l — \(3i) k on the two sides of the above equation are equal 
for each k. □ 
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Corollary 10. 

oo 

^(A//i) n P(n -> m^t^Pin -> m 2 ,t 2 ) 



n=0 



= ^(A//i) mi P(m! -> n, ti)P(n -> m 2 , t 2 ) 

n=0 

= (A//i) mi P(m! ->m 2 ,ti+t 2 ) 

= (A/^) mi E ( 7 ) ( m7- e ) (/i/3)6(1 ~ ^) mi_£ (^) m2 " mi+e (l - A/3) mi " e+1 

= (\/ij) mi P(B 12 (b,m u c,m2)\s lb ~ s 2c , Lj > 6 + mi) (38) 

Proof. The second and the third lines follow from Corollary El The forth line follows 
from LemmaEland the fact that the expression in the third line is a function of £i+t 2 . 
If su ~ s 2c then there is some character s a in So such that Si& ~ s 0a ~ s 2c . This 
implies the last line. □ 

Let N l i(a, m, b, n) denote the event that B l3 (a, m, b, n) AND Si P Sj q for a < p < 
a + m + 1 and 6<g<6 + n + l. Informally speaking, it is an event that represents 
a block with only one pair of homologous characters. 

The following lemma relates the probabilities of P-type and iV-type events. 

Lemma 11. 

P(B 12 (b ,m,c ,n)\s lbo ~ s 2co , L x > b + m) 

= P(N 12 (b ,m,c ,n)\s lbo ~ s 2co , L x >b + m) 

k 

+ E X fe JJP(A^ 12 (6 i ,m i ,c i ,n i )|s lfei ~ s 2Ci , Lx > 6i + m 4 ) 

l<k<min(n,m) i=0 

where Ci = Cj_i + n^-i + 1 and 6j = + m.j_i + 1 /or z > 1. 
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Proof. We have considered all possible ways in which a B(. . .)-type event is parti- 
tioned into N(. . .)-type events. The index k denotes the number of pairs of homol- 
ogous characters in Si[bo, bo + m] and S 2 [co, cq + n] other than the homologous pair 
sib ~ s 2co . Therefore, each value of k partitions the B(. . .)-type block in N(. . .)-type 
blocks. Histories contributing to each N(. . .)-type block are independent. The term 
corresponding to the k = case is written separately as the first term because that 
is what we would like to compute. □ 

Lemma 12. 



P(N 12 (b, m, c, n)\s lb ~ s 2c , L\ > b + m) 

= E ( 7 ) ( m - e ) ~ x ~ /^r~ e ( w~ m+e (i - m m ~ e+1 

(40) 

Proof. It follows from Corollary El that the LHS of Equation (jHSj) is a function of t% + 
t 2 . We prove by induction onm + 11 that P(N 12 (bo, m, Co, n)\sib ~ S2 C0 , L\ > &o + m ) 
is a function of t\ +t 2 . Probabilities p(A), p(B) and p(C V D) computed in Section^] 
are functions of ti + t 2 , and correspond to m + n = 0, 1 and 2, respectively, and are 
used as the base case of induction. Let P(N 12 (b , m, c , n)\sib ~ s 2co , L\ > b + m) 
be a function of ti + t 2 for m + n < r, where r > 2. Let m + n = r in Equation ()39j) . 
Since rrii + rii < r for all terms in the summation, and the L.H.S. of Equation 
is a function of t\ + t 2 , the only remaining term (the first term on the R.H.S. of 
Equation (ffiljO is a function of ti + t 2 . 

Once the L.H.S. of Equation (jlUj) is proved to be independent of the root location, 
the R.H.S. is written by treating X\ as the ancestor and arguing in the same manner 
as in Lemma |H1 □ 

Lemma 13. Let Ti 12 be a homology structure on S\ and S 2 . Then p(H 12 ) is a function 
ofh+t 2 . 

Proof. A homology structure has a natural decomposition into blocks each of which 
has only one pair of homologous characters. Events in distinct blocks are independent. 
Thus the result follows from Lemma IT2*1 

□ 

Theorem 14. p(7i ohs ) is independent of the root location. 

Proof. This follows from Propositions El and and Lemma El D 
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Remark The proofs presented here (in particular the proof that P(N l2 (. . .) is a 
function of t% + t^) appear somewhat convoluted. We could have written results 
similar to Lemma |H1 and Corollary El for the probabilities of the N(. . .) type events. 
For example, the following corollary is similar to Corollary El 

Corollary 15. Fori e {1, 2}, 

(\/n) n P(N 0i (a, n, b, m) \s lb ~ s 0a , L >a + n) 

= (\/i2) m P(N m (a jm ,b,n)\s lb ~ s 0a , L > a + m) (41) 

But this is not useful to claim that P(N 12 (. . .)) is a function of ti + t 2 . For 
example, we cannot prove something like Corollary I1UI for P(N 12 (. . .)) because there 
may be a situation such as su s 2j -, su ~ sok and soft <*> s^j- As a result the second 
step of Corollary ITU1 (derivation of the 3rd line from the 2nd) fails for P(N l2 (. . .)). 
In fact, Equation (|41|) cannot be regarded as a "detailed balance condition" even if it 
looks like one. If a Markov process has transition probabilities given by Pjj and if 7Tj is 
the stationary distribution, then the process is reversible if and only if itiPij = ftjPji- 
But in our problem, P(N 12 (. . .)) is not a probability of transition from state 1 to 
state 2. It is merely a probability of observing certain configuration that is defined by 
parameters that take values 1 and 2. A similar condition has been used as a detailed 
balance condition in [TH] (see Equation (30) on p539). 

Theorem El has been believed to be true by researchers, and has been used in 
[T3j to test correctness of implementation of algorithms. In general such a result 
should not be taken for granted as a consequence of reversibility of the model and 
the Pulley Principle of Felsenstein. In fact, a result such as Theorem fT4l implies the 
pulley principle. 
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